This research presents a comprehensive deep learning based framework for automatic forensic speaker recognition, designed to address the challenges commonly encountered in speaker identification tasks. The proposed system employs Convolutional Neural Networks (CNNs) trained on mel spectrogram representations of speech signals to capture distinctive vocal attributes unique to each individual. A dataset composed of speech recordings from 20 speakers, equally divided between male and female participants, was preprocessed into 3-second mel spectrogram segments to ensure consistent analysis and robust feature extraction.
The CNN model was optimized to perform speaker classification and to generate discriminative speaker embedding’s that effectively represent vocal identity. Performance evaluation demonstrated high classification accuracy and strong generalization across varying acoustic conditions. To assess the forensic reliability of the learned representations, the embedding distributions were analyzed using t-SNE visualization techniques. The resulting plots revealed well-defined speaker clusters with low intra-speaker variation, confirming the model\'s capability to differentiate between distinct voices.
Overall, the outcome highlight the effectiveness and interpretability of the proposed CNN-based approach for forensic speaker recognition. This framework holds significant potential for real world forensic applications, including voice evidence authentication, suspect identification, and speaker profiling. Furthermore, it provides a foundation for future research in developing more resilient and transparent models for forensic speech processing.
Introduction
Speaker recognition plays a crucial role in forensic investigations by enabling the identification or verification of individuals from voice evidence such as phone calls and surveillance recordings. However, forensic speaker recognition is challenging due to intra-speaker variability (changes in a speaker’s voice caused by emotion, health, or speaking style), inter-speaker similarity, and poor audio quality in real-world recordings.
This study proposes a CNN-based speaker recognition framework that uses mel spectrograms as input features to address these challenges. Mel spectrograms provide a perceptually meaningful time–frequency representation of speech, allowing CNNs to learn robust and discriminative speaker embeddings. Beyond classification accuracy, the framework emphasizes forensic interpretability by analyzing inter- and intra-speaker variability in the learned embedding space and visualizing speaker separability using t-SNE.
The literature review outlines the evolution from classical approaches such as GMM-UBM and i-vectors to modern deep learning methods, including x-vectors, CNNs, and LSTMs. While deep learning models show improved robustness under noisy and short-utterance conditions, limited attention has been given to embedding interpretability and variability analysis in forensic contexts. This research addresses that gap by combining high-performance CNN-based classification with embedding visualization and variability analysis.
The methodology uses a controlled speech database consisting of 1,000 recordings from 20 speakers (10 male and 10 female), recorded under standardized conditions. Preprocessing involves extracting normalized mel spectrograms with fixed dimensions. A CNN architecture processes these spectrograms to generate 128-dimensional speaker embeddings, which are then classified using a fully connected layer. Model performance and speaker variability are evaluated through classification results and visual analysis of embeddings, demonstrating the framework’s effectiveness and forensic relevance.
Conclusion
This research presents a deep learning-based framework for forensic speaker recognition, utilizing CNNs and mel spectrograms to effectively classify speakers and extract meaningful speaker embedding’s. The system demonstrates high classification accuracy and robust performance in representing individual vocal characteristics, making it a promising approach for forensic applications. A key strength of the framework is its ability to capture and visualize intra- and inter-speaker variability through embedding analysis and t-SNE projection. The results confirm the model\'s potential to support speaker identification and differentiation, which are essential in forensic investigations.
References
[1] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N Kingsbury, B. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97.
[2] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329-5333). IEEE.
[3] Nagrani, A., Chung, J. S., & Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. In INTERSPEECH 2017 (pp. 2616-2620).
[4] Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11).
[5] Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1-3), 19-41.
[6] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798.
[7] He, K., Zhang, X., Ren, S., & Sun, J. (2016) “Deep residual learning for image recognition”. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[8] O\'Shaughnessy, D. (2000). Speech communications: human and machine. IEEE Press.
[9] Fougeron Cecile., (2022), Intra-speaker phonetic variation in read speech: comparison with inter-speaker variability in a controlled population. DOI: 10.21437/Interspeech.2022-10965.
[10] Supaporn Bunrit., (2019), Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network. International journal of Machine Learning and computing, vol.9, No. 2, 143-148.
[11] Volker Dellwo., Adrian Leemann., Marie-Jose Kolly., (2014), The Recognition of read and spontaneous speech in local vernacular: The Case of Zurich German. Journal of Phonetics
[12] Andrej Drygajlo, “Automatic Seaker Recognition for Forensic Case Assessment and Interpretation”, Forensic Speaker Recognition 2012.
[13] Tharmarajah Thiruvaran, Eliathamby Ambikairajah, Jukien Epps, “FM Features for Automatic Forensic Speaker Recognition”, ISCA 2008
[14] L.A.Khan, M.S.Bai G, Amr M. Youssef, “Speaker Recognition from Encrypted VoIP Communications”, ELSEVIER 2009
[15] Enrico Marchetto, Federico Avanzini, Federico Flego,” An Automatic Speaker Recognition System for Intelligence Applications”, EURASIP 2009
[16] S. Malik, Fayyaz A. Afsar, “Wavelet Transform Based Automatic Speaker Recognition”, IEEE 2009
[17] Joseph P.Campbell, Wade Shen, “Forensic Speaker Recognition”, 2009
[18] Tomi Kinnunen, Haizhou Li, “An Overview of Text-Independent Speaker Recognition: from Features to Supervectors”, 2009
[19] Vibha Tiwari, “MFCC and its applications in speaker recognition”, 2010
[20] M.G.Sumithra, K. Thanuskodi and A.Helen Jenifer Archana, “A New Speaker Recognition System with Combined Feature Extraction Techniques”, Journal of Computer Science 2011
[21] Miranti Indar Mandasari, Mitchell Mclaren, David A. Van Leuven, “The Effect of Noise on Modern Automatic Speaker Recognition Systems”, IEEE 2012
[22] Andrej Drygajlo, “Automatic Speaker Recognition for Forensic Case Assessment and Interpretation”, 2012
[23] Homayoon Beigi, “Speaker recognition: Advancements and Challenges”, 2012
[24] Parul, R.B. Dubey, “Automatic Speaker Recognition System”, 2012
[25] Karthik Selvan, Aju Joseph, Anish Babu K.K., “Speaker Recognition System for Security Application”, IEEE Recent Advances in Intelligent Computational System 2013
[26] Najiya Abdualrahiman , Ranju K.V., “ Text Dependent Speaker Recognition”,2013.
[27] Omid Ghahabi, Javieer Hernando, “Deep Belief Networks for I- Vector Based Speaker Recognition”, IEEE 2014
[28] Ehsan Varini, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, Javier Gonzalez-Dominuez, “Deep Neural Network for Small Footprint Text-Dependent Speaker Verification”, IEEE International Conference on Acoustic, Speech and Signal Processing 2014
[29] Freds Richardson, “Deep Neural Network Approaches To Speaker and Language Recognition”, IEEE 2015
[30] Miss.Sarika S. Admuth, Mrs.Shubhada Ghugardare,” Survey Paper on Automatic Speaker Recognition Systems”, 2015